{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Calculating Annotation Coverage\n",
    "This section shows how to calculate annotation coverage as described here:    \n",
    "\n",
    "    Annotation coverage of Gene Ontology (GO) terms to individual \n",
    "    gene products is high for human or model organisms: \n",
    "      * 87% of ~20k human protein-coding genes have GO annotations\n",
    "      * 76% of ~14k fly protein-coding genes have GO annotations\n",
    "    (Apr 27, 2016)\n",
    "\n",
    "## 1. Download associations\n",
    "NCBI's gene2go file contains annotations of GO terms to Entrez GeneIDs for over 35 different species. We are interested in human and fly which have the taxids 9606 and 7227 repectively."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  EXISTS: gene2go\n"
     ]
    }
   ],
   "source": [
    "# Get ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2go.gz\n",
    "from goatools.base import download_ncbi_associations\n",
    "gene2go = download_ncbi_associations()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Read associations\n",
    "\n",
    "### 2a. You can read the associations one species at a time..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from goatools.associations import read_ncbi_gene2go\n",
    "\n",
    "geneid2gos_human = read_ncbi_gene2go(gene2go, taxids=[9606])\n",
    "geneid2gos_fly = read_ncbi_gene2go(gene2go, taxids=[7227])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2b. Or you can read 'gene2go' once and load all species..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from collections import defaultdict, namedtuple\n",
    "\n",
    "taxid2asscs = defaultdict(lambda: defaultdict(lambda: defaultdict(set)))\n",
    "geneid2gos_all = read_ncbi_gene2go(\n",
    "        gene2go, \n",
    "        taxids=[9606, 7227], \n",
    "        taxid2asscs=taxid2asscs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Import protein-coding information for human and fly"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from goatools.test_data.genes_NCBI_9606_ProteinCoding import GeneID2nt as GeneID2nt_human\n",
    "from goatools.test_data.genes_NCBI_7227_ProteinCoding import GeneID2nt as GeneID2nt_fly\n",
    "lst = [\n",
    "    (9606, GeneID2nt_human),\n",
    "    (7227, GeneID2nt_fly)\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Calculate Gene Ontology coverage\n",
    "Store GO coverage information for *human* and *fly* in the list, **cov_data**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "cov_data = []\n",
    "NtCov = namedtuple(\"NtCov\", \"taxid num_GOs num_covgenes coverage num_allgenes\")\n",
    "for taxid, pcGeneID2nt in lst:\n",
    "    # Get GeneID2GOs association for current species\n",
    "    geneid2gos = taxid2asscs[taxid]['GeneID2GOs']\n",
    "    # Restrict GeneID2GOs to only protein-coding genes for this report\n",
    "    pcgene_w_gos = set(geneid2gos.keys()).intersection(set(pcGeneID2nt.keys()))\n",
    "    num_pcgene_w_gos = len(pcgene_w_gos)\n",
    "    num_pc_genes = len(pcGeneID2nt)\n",
    "    # Number of GO terms annotated to protein-coding genes\n",
    "    gos_pcgenes = set()\n",
    "    for geneid in pcgene_w_gos:\n",
    "        gos_pcgenes |= geneid2gos[geneid]\n",
    "    # Print report data\n",
    "    cov_data.append(NtCov(\n",
    "        taxid = taxid,\n",
    "        num_GOs = len(gos_pcgenes),\n",
    "        num_covgenes = num_pcgene_w_gos,\n",
    "        coverage = 100.0*num_pcgene_w_gos/num_pc_genes,\n",
    "        num_allgenes = num_pc_genes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5 Report Gene Ontology coverage for human and fly\n",
    "Print the *human* and *fly* GO coverage information that is stored in the list, **cov_data**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      " taxid    GOs GeneIDs Coverage\n",
      "------ ------ ------- ----------------------\n",
      "  9606 16,436  18,273  87% GO coverage of 20,913 protein-coding genes\n",
      "  7227  6,995  10,591  76% GO coverage of 13,919 protein-coding genes\n"
     ]
    }
   ],
   "source": [
    "from __future__ import print_function\n",
    "\n",
    "print(\" taxid    GOs GeneIDs Coverage\")\n",
    "print(\"------ ------ ------- ----------------------\")\n",
    "fmtstr = \"{TAXID:>6} {N:>6,} {M:>7,}  {COV:2.0f}% GO coverage of {TOT:,} protein-coding genes\"\n",
    "for nt in cov_data:\n",
    "    print(fmtstr.format(\n",
    "        TAXID = nt.taxid,\n",
    "        N = nt.num_GOs,\n",
    "        M = nt.num_covgenes,\n",
    "        COV = nt.coverage,\n",
    "        TOT = nt.num_allgenes))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Copyright (C) 2016, DV Klopfenstein, H Tang. All rights reserved."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
